Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating their robustness against distribution shifts is crucial before adopting them in real-world applications. In this paper, we investigate the robustness of 9 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (MMI and MOR) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models.
translated by 谷歌翻译
We address interactive panoptic annotation, where one segment all object and stuff regions in an image. We investigate two graph-based segmentation algorithms that both enforce connectivity of each region, with a notable class-aware Integer Linear Programming (ILP) formulation that ensures global optimum. Both algorithms can take RGB, or utilize the feature maps from any DCNN, whether trained on the target dataset or not, as input. We then propose an interactive, scribble-based annotation framework.
translated by 谷歌翻译
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.
translated by 谷歌翻译
In recent years, multi-scale generative adversarial networks (GANs) have been proposed to build generalized image processing models based on single sample. Constraining on the sample size, multi-scale GANs have much difficulty converging to the global optimum, which ultimately leads to limitations in their capabilities. In this paper, we pioneered the introduction of PAC-Bayes generalized bound theory into the training analysis of specific models under different adversarial training methods, which can obtain a non-vacuous upper bound on the generalization error for the specified multi-scale GAN structure. Based on the drastic changes we found of the generalization error bound under different adversarial attacks and different training states, we proposed an adaptive training method which can greatly improve the image manipulation ability of multi-scale GANs. The final experimental results show that our adaptive training method in this paper has greatly contributed to the improvement of the quality of the images generated by multi-scale GANs on several image manipulation tasks. In particular, for the image super-resolution restoration task, the multi-scale GAN model trained by the proposed method achieves a 100% reduction in natural image quality evaluator (NIQE) and a 60% reduction in root mean squared error (RMSE), which is better than many models trained on large-scale datasets.
translated by 谷歌翻译
While many systems have been developed to train Graph Neural Networks (GNNs), efficient model inference and evaluation remain to be addressed. For instance, using the widely adopted node-wise approach, model evaluation can account for up to 94% of the time in the end-to-end training process due to neighbor explosion, which means that a node accesses its multi-hop neighbors. On the other hand, layer-wise inference avoids the neighbor explosion problem by conducting inference layer by layer such that the nodes only need their one-hop neighbors in each layer. However, implementing layer-wise inference requires substantial engineering efforts because users need to manually decompose a GNN model into layers for computation and split workload into batches to fit into device memory. In this paper, we develop Deep Graph Inference (DGI) -- a system for easy and efficient GNN model inference, which automatically translates the training code of a GNN model for layer-wise execution. DGI is general for various GNN models and different kinds of inference requests, and supports out-of-core execution on large graphs that cannot fit in CPU memory. Experimental results show that DGI consistently outperforms layer-wise inference across different datasets and hardware settings, and the speedup can be over 1,000x.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Persona-based dialogue systems aim to generate consistent responses based on historical context and predefined persona. Unlike conventional dialogue generation, the persona-based dialogue needs to consider both dialogue context and persona, posing a challenge for coherent training. Specifically, this requires a delicate weight balance between context and persona. To achieve that, in this paper, we propose an effective framework with Persona-Adaptive Attention (PAA), which adaptively integrates the weights from the persona and context information via our designed attention. In addition, a dynamic masking mechanism is applied to the PAA to not only drop redundant information in context and persona but also serve as a regularization mechanism to avoid overfitting. Experimental results demonstrate the superiority of the proposed PAA framework compared to the strong baselines in both automatic and human evaluation. Moreover, the proposed PAA approach can perform equivalently well in a low-resource regime compared to models trained in a full-data setting, which achieve a similar result with only 20% to 30% of data compared to the larger models trained in the full-data setting. To fully exploit the effectiveness of our design, we designed several variants for handling the weighted information in different ways, showing the necessity and sufficiency of our weighting and masking designs.
translated by 谷歌翻译
在本文中,我们提出了一个新型模型,用于同时稳定的共同检测(COSOD)和对象共裂(Coseg)。为了准确地检测共同水平(分割),核心问题是井井有条模拟图像组之间的图像间关系。一些方法设计了复杂的模块,例如复发性神经网络(RNN),以解决此问题。但是,对订单敏感的问题是RNN的主要缺点,它严重影响了拟议的COSOD(COSEG)模型的稳定性。在本文中,受基于RNN的模型的启发,我们首先提出了一个多路稳定的复发单元(MSRU),其中包含虚拟订单机制(DOM)和复发单元(RU)。我们提出的MSRU不仅有助于COSOD(COSEG)模型捕获强大的图像间关系,还可以降低订单敏感性,从而导致更稳定的推理和训练过程。 {此外,我们设计了一个跨顺序对比损失(COCL),可以通过关闭从不同输入订单生成的功能嵌入来进一步解决订单敏感问题。}我们在五个广泛使用的COSOD数据集(COCA,COOCA,COSOD3K,,,COSOD3K, COSAL2015,ICOSEG和MSRC)以及三个广泛使用的数据集(Internet,Icoseg和Pascal-Voc)用于对象进行分割,性能证明了与最先进的ART(SOTA)相比,提出的方法的优越性方法。
translated by 谷歌翻译
与传统的头像创建管道相反,这是一个昂贵的过程,现代生成方法直接从照片中学习数据分布,而艺术的状态现在可以产生高度的照片现实图像。尽管大量作品试图扩展无条件的生成模型并达到一定程度的可控性,但要确保多视图一致性,尤其是在大型姿势中,仍然具有挑战性。在这项工作中,我们提出了一个3D肖像生成网络,该网络可产生3D一致的肖像,同时根据有关姿势,身份,表达和照明的语义参数可控。生成网络使用神经场景表示在3D中建模肖像,其生成以支持明确控制的参数面模型为指导。尽管可以通过将图像与部分不同的属性进行对比,但可以进一步增强潜在的分离,但在非面积区域(例如,在动画表达式)时,仍然存在明显的不一致。我们通过提出一种体积混合策略来解决此问题,在该策略中,我们通过将动态和静态辐射场融合在一起,形成一个复合输出,并从共同学习的语义场中分割了两个部分。我们的方法在广泛的实验中优于先前的艺术,在自由视点中观看时,在自然照明中产生了逼真的肖像。所提出的方法还证明了真实图像以及室外卡通面孔的概括能力,在实际应用中显示出巨大的希望。其他视频结果和代码将在项目网页上提供。
translated by 谷歌翻译
自我介绍在训练过程中利用自身的非均匀软监管,并在没有任何运行时成本的情况下提高性能。但是,在训练过程中的开销经常被忽略,但是在巨型模型的时代,培训期间的时间和记忆开销越来越重要。本文提出了一种名为ZIPF标签平滑(ZIPF的LS)的有效自我验证方法,该方法使用网络的直立预测来生成软监管,该软监管在不使用任何对比样本或辅助参数的情况下符合ZIPF分布。我们的想法来自经验观察,即当对网络进行适当训练时,在按样品的大小和平均分类后,应遵循分布的分布,让人联想到ZIPF的自然语言频率统计信息,这是在按样品中的大小和平均值进行排序之后进行的。 。通过在样本级别和整个培训期内强制执行此属性,我们发现预测准确性可以大大提高。使用INAT21细粒分类数据集上的RESNET50,与香草基线相比,我们的技术获得了 +3.61%的准确性增长,而与先前的标签平滑或自我验证策略相比,增益增加了0.88%。该实现可在https://github.com/megvii-research/zipfls上公开获得。
translated by 谷歌翻译